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Laplace’s rule of succession in information geometry 


Yann Ollivier 


When observing data x\,,xt modelled by a probabilistic distribution 
pg(x ), the maximum likelihood (ML) estimator 0 ML = arg maxg J2l= 1 1 npe(%i) 
cannot, in general, safely be used to predict xt+\- For instance, for a 
Bernoulli process, if only “tails” have been observed so far, the probability 
of “heads” is estimated to 0. Laplace’s famous “add-one” rule of succession 
(e.g., [Grii07]) regularizes 9 by adding 1 to the count of “heads” and of “tails” 
in the observed sequence. 

Bayesian estimators suffer less from this problem, as every value of 9 
contributes, to some extent, to the Bayesian prediction of xt+i knowing x\-t- 
However, their use can be limited by the need to integrate over parameter 
space or to use Monte Carlo samples from the posterior distribution. 

For Bernoulli distributions, Laplace’s rule is equivalent to using a uni¬ 
form prior on the Bernoulli parameter. The non-informative Jeffreys prior 
on the Bernoulli parameter corresponds to Krichevsky and Trofimov’s “add- 
one-half” rule [KT81]. Thus, in this case, some Bayesian predictors have a 
simple implementation. 

We claim (Theorem 1) that for exponential families 1 , Bayesian predic¬ 
tors can be approximated by mixing the ML estimator with the sequential 
normalized maximum likelihood (SNML) estimator from universal coding 
theory [HSKM08, RR08], which is a fully canonical version of Laplace’s rule. 
The weights of this mixture depend on the density of the desired Bayesian 
prior with respect to the non-informative Jeffreys prior, and are equal to 
1/2 for the Jeffreys prior, thus extending Krichevsky and Trofimov’s result. 
The resulting mixture also approximates the “flattened” ML estimator from 
[KGDR10]. 

Thus, it is possible to approximate Bayesian predictors without the cost 
of integrating over 9 or sampling from the posterior. The statements below 
emphasize the special role of the Jeffreys prior and the Fisher information 
metric. Moreover, the analysis reveals that the direction of the shift from 
the ML predictor to Bayesian predictors is systematic and given by an intrin¬ 
sic, information-geometric vector field on statistical manifolds. This could 
contribute to regularization procedures in statistical learning. 

1 For simplicity we only state the results with i.i.d. models. However the ideas extend 
to non-i.i.d. sequences with pe(xt+i\xi : t) in an exponential family, e.g., Markov models. 
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1. Notation and statement. Let pg(x ) be a family of distributions on 
a variable x, smoothly parametrized by 0. Let x\ be a sequence 
of observations to be predicted online using pg. The maximum likelihood 
predictor is 

t 

P ML (x t+ 1 = y\xi :t ) ■= p e ML(y), 0f lh := arg max V Inpg(xi) (1) 

* 9 i= i 

Bayesian predictors (e.g., Laplace’s rule) usually differ from p ML at order 

1 It- 

The sequential normalized maximum likelihood predictor [RSKM08, RR08] 
uses, for each possible value y of xt+i, the parameter 0 ML+ v that would yield 
the best probability if y had already been observed. Since this increases the 
probability of every y, it is necessary to renormalize. Define 

0 ML+y arg j nax |l n p 0 (y) + lnp e (xj)| (2) 

as the ML estimator when adding y at position t + 1. For each y let 

p SNML (x t+ i = y\xi:t) := p e ML+y(y ) (3) 

be the SNML predictor for time t + 1, where Z is a normalizing constant. 2 

For Bernoulli distributions, p SNML coincides with Laplace’s “add-one” 
rule. 3 For other distributions the two may differ: for instance, defining 
Laplace’s rule for continuous-valued x requires choosing a prior distribution 
on x, whereas the SNML distribution is completely canonical. 

We claim that for exponential families, ^(p ML -|-p SNML ) is close to the 
Bayesian predictor using the Jeffreys prior. This generalizes the “add-one- 
half” rule. 

This extends to any Bayesian prior n by using a weighted SNML predictor 
^-SNML (y) := +y )pf)ML+y{y) (4) 

The weight w(0) to be used for a given prior n will depend on the ratio 
between it and the Jeffreys prior. Recall that the latter is 7r Joffroys (d$) := 
\/detI(0) d 0 where X is the Fisher information matrix of the family (pg), 

1(0) := -E x ^ pe dg Inp g (x) (5) 

where <9| stands for the Hessian matrix of a function of 0. 

2 This variant of SNML is SNML-1 in [RSKM08] and CNML-3 in [Grii07]. 

3 Note that we describe it in a different way. The usual presentation of Laplace’s rule 
is to define 0 Lap := arg max e {In pg (“heads”) + In pg (“tails”) + ^lnpa(a:i)} and then use 
# Lap to predict Xt+i- Here we follow the SNML viewpoint and use a different 0 ML+M for 
each possible value y of xt+i- 
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Theorem 1 . Let pg be an exponential family of probability distributions, 
and let it be a Bayesian prior on 9. Then, under suitable regularity assump¬ 
tions, the Bayesian predictor with prior it knowing x± : t is equal to 

ip ML (-|* 1,) + ^/ 2 - SNML (-|zi :1 ) (6) 

up to 0(l/t 2 ), where f3(0) is the density of x with respect to the Jeffreys 
prior, i.e., 7r(d 8) = /3(9)y / detZ(9) d 8 with T the Fisher matrix. 

More precisely, both under the prior n and under p ML +p@ - SNML ) ; the 
probability that xt+i = y given x\ : t is asymptotically 

p d Mc(y) {^- + Yt W d o ln Pe(y)\\ 2 F + \ {debrP,d e \npe(y)) F - d ™ Q + 0(l/t 2 )^ 

(7) 

provided p d Mc(y) > 0, where (dgf ,dgg) F := (dgf) T Z l (9)dgg Is the Fisher 
scalar product and ||5e/||^ = (dgf , dgf) F is the Fisher metric norm of dgf. 

For the Jeffreys prior (constant /3), this also coincides up to 0(l/t 2 ) 
with the “flattened” or “squashed” ML predictor from [KGDR10, GK10] 
with no = 0. In particular, the latter is 0(l/t 2 ) close to the Jeffreys prior, 
and the optimal regret guarantees in [KGDR10] apply to (7). Note that a 
multiplicative 1 + 0(l/t 2 ) difference between predictors results in an 0(1) 
difference on cumulated regrets. 

Regularity assumptions. In most of the article we assume that pg(xt+ 1 1 x± : t) 
is a non-degenerate exponential family of probability distributions. The key 
property we need from exponential families is the existence of a parametriza- 
tion 9 in which dig In pg(x) = —Z(9) for all x and 9. For simplicity we assume 
that the space for x is compact, so that to prove 0(l/t 2 ) convergence of dis¬ 
tributions over x it is enough to prove 0(l/f 2 ) convergence for each value 
of x. We assume that the sequence of observations ( xt)t eN is an ineccsi 
sequence [Grii07], namely, that for t large enough, the maximum likelihood 
estimate stays in a compact subset of the parameter space. The Bayesian 
priors are assumed to be smooth with positive densities. In some parts of 
the article we do not need pg to be an exponential family, but we still assume 
that the model pg is smooth, that there is a well-defined maximum for 
any x\,t and no other log-likelihood local maxima. 

2. Computing the SNML predictor. We prove Theorem 1 by proving 
that both predictors are given by (7). Further proofs are gathered at the 
end of the text. 

We first work on p SNML . Here we do not assume that pg is an exponential 
family. Let Jt be the observed information matrix, assumed to be positive- 
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definite 


1 . * 

Jt( e ) '■= -7 d o ln 'Po( x i) ( 8 ) 

1 i =1 

Proposition 2. Under suitable regularity assumptions, the maximum 
likelihood update from t to t + 1 satisfies 

Cl = C L + ^(C L ) _1 lnpe(* t+ i) + 0(1A 2 ) (9) 

For exponential families, this update is the natural gradient of \np(xt+i) 
with learning rate 1/f [Ama98], because </t(C L ) = ^(C L )> the exact Fisher 
information matrix. (For exponential families in the natural parametrization, 

Jt(8) = 1(0) for all 9. But since the Hessian of a function / on a manifold 
is a well-defined tensor at a critical point of /, it follows that at C L one 
has Jj(C L ) = 2-(C L ) f° r an y parametrization of an exponential family.) 

Proposition 3. Under suitable regularity assumptions, 

pSN ML (y\ Xl:t ) = -^PgMi(y) ^1 + ^(dglnp e (y)) T J i ~ 1 d e \np e (y) + 0(l/t 2 )J 

(10) 

provided p g ml ( y) > 0, where Jt is as above and the derivatives are taken at 

qML 

Importantly, the normalization constant Z can be computed without 
having to sum over y explicitly. Indeed (cf. [ 1DR10]), by definition of 

1(0), 

^y~ Pe (de In p 0 (y)) T J t 1 dglnpg(y) = Tr (J t l l(0)) (11) 

so that Z = 1 + j Tr(J f _1 X(C L )) + 0(l/t 2 ). For exponential families, Jt =1 
at C L so that Z = 1 + dl ™ 0 + 0(\/t 2 ) and 

PgMr(y) ^1 + ^ (dg Inp g (y)) T l” 1 dg Inp g (y) - dl ™ 9 ^ (12) 

is an 0(l/t 2 ) approximation of p SNML (y|a:i : t). 

For the weighted SNML distribution p w ~ SJ<IML ; a similar argument yields 

P w ~ SNML (y\xi:t) = ^PgML(y) ^1 + j(dg In p e (y)) T J^ 1 (dg In pg(y) + dg In w(6)) + 0(l/t 2 )J 

(13) 

with Z = 1 + jTr (J t 1 1(0] AL )) + 0(l/t 2 ) as above. (The dg In w term does 
not contribute to Z because Yhy Pe(y)dg Inpg(y) = 0.) 

Computing | p ML + ^p w ~ sn ml W (Q) = /3(6) 2 in (13), and using that 

Jt(0 ML) =1 for exponential families, proves one half of Theorem 1. 
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3. Computing the Bayesian posterior. Next, let us establish the 
asymptotic behavior of the Bayesian posterior. This relies on results from 
[TK86]. The following proposition may have independent interest. 

Proposition 4. Consider a Bayesian prior ir(d9) = a(9) d 9. Then the 
posterior mean of a smooth function f(9) given data x± : t and prior ir is 
asymptotically 

ne^ + hde/fjpa, fin —J—£=== ) +i-Tr(Jpa'$f)+0(l/t 2 ) (14) 

where L(9) ■= j\npe(x\ : t) is the average log-likelihood function, is the 
Hessian matrix w.r.t. 9, and Jt := — dgL(9^ L ) is the observed information 
matrix. 

When po is an exponential family in the natural parametrization, for 
any x\ : t, —dgL is equal to the Fisher matrix X, so that the denominator in 
the log is the Jeffreys prior y /det X. In particular, for exponential families 
in natural coordinates, the first term vanishes if the prior ir is the Jeffreys 
prior. 

Corollary 5. Letpg be an exponential family. Consider a Bayesian prior 
(3(9) y/det 1(9) d 9 having density (3 with respect to the Jeffreys prior. Then 
the posterior probability that xt+i = y knowing x\-t is asymptotically given 
by (7) as in Theorem 1. 

This proves the second half of Theorem 1. 


4. Intrinsic viewpoint. When rewritten in intrinsic Riemannian terms, 
Proposition 4 emphasizes a systematic discrepancy at order 1 jt between ML 
prediction and Bayesian prediction, which is often more “centered” as in 
Laplace’s rule. 

This is characterized by a canonical vector held on a statistical manifold 
indicating the direction of the difference between ML and Bayesian predic¬ 
tors, as follows. In intrinsic terms, the posterior mean (14) in Proposition 4 


f(« UL )-)(V 2 L)-' (d/. din -Dr ((VT-^/Rod/t 2 ) 

(15) 

where L(9) = Y)i=t ^ n Pd( x i) as above and where V 2 is the Riemannian Hes¬ 
sian with respect to any Riemannian metric on 9, for instance the Fisher met¬ 
ric. This follows from a direct Riemannian-geometric computation (e.g., in 

4 The equality between (14) and (15) holds only at the value of (14) is not intrinsic 
away from d ML . The equality relies on dgL = 0 at d ML to cancel curvature contributions. 
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normal coordinates). In this expression both, the prior 7r(d0) and \/det(—V 2 L) 
are volume forms on the tangent space so that their ratio is coordinate- 
independent. 5 

At first order in 1 /t, this is the average of / under a Riemannian Gaussian 
distribution 6 with covariance matrix 4(—V 2 ^)" 1 , but centered at 0 ML — 
j(V 2 L) -1 dln(7r/\/det(—V 2 L)) instead of # ML . 

Thus, if we want to approximate the posterior Bayesian distribution by 
a Gaussian, there is a systematic shift yP(0 ML ) between the ML estimate 
and the center of the Bayesian posterior, where V is the data-dependent 

vector held _ 

V :=- (V 2 ^)- 1 dIn ^7r/ \J det(—V 2 L)^ (16) 

A particular case is when n is the Jeffreys prior: then 

y = l(V 2 L)- 1 dlndet(-2r 1 V 2 L) (17) 

is an intrinsic vector held dehned on any statistical manifold, depending on 
xi-.t- 


Proposition 6. When the prior is the Jeffreys prior, the vector V is 

V i = ^(ViVjLy'iVkViL)- 1 VjVkViL (18) 

in Einstein notation, where L(9) = y J2l=i i n Po( x s) is the log-likelihood 
function, and V is the Levi-Civita connection of the Fisher metric. 7 

If pg is an exponential family with the Jeffreys prior, the value of V at 
0 ML does not depend on the observations x\ : t and is equal to 

P’(0 ML ) = IzVl kl T jkl (19) 


where T is the skewness tensor [ANOO, Eq. (2.28)] 

^ f n\ TP d In p 0 (x) d In po(x) dhipo(x) 

ddi do k de l 


( 20 ) 


V(& ML ) is thus an intrinsic, data-independent vector held for exponential 
families, which characterizes the discrepancy between maximum likelihood 
and the “center” of the Jeffreys posterior distribution. Note that V can 
be computed from log-likelihood derivatives only. This could be useful for 
regularization of the ML estimator in statistical learning. 

5 This is clear when dividi ng both by the Riemannian volume form ydet g: both the 
prior density 7r/^/det g and yj det(— g- 1 V 2 L) are intrinsic. 

6 i.e., the image by the exponential map of a Gaussian distribution in a tangent plane. 

7 Note that VjVfeViL is not fully symmetric. Still it is symmetric at d ML , because the 
various orderings differ by a curvature term applied to VL with vanishes at 0 ML . 
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5. Proofs (sketch). 

Proof of Proposition 2. 

Minimization of a Taylor expansion of log-likelihood around 0| vIL . This is 
justified formally by applying the implicit function theorem to F : (e, 8 ) i->- 
d e (e In pg(xt+i) + j Ei=l In Po{x t )} at point (0, 6> ML ). □ 

Proof of Proposition 3. 

Abbreviate 9 y := 9^ L+y . From Proposition 2 we have 

0 y = 6> t ML + \jr l de In pe(y) + 0 {l/t 2 ) (21) 

and expanding In p g (y) around 6^ L yields pe y (y) = p g Mh(y)(l+(9 y —6f LL ) T d g lnp g (y))+ 
0 ((8 y — 0 ML ) 2 ) and plugging in the value of 6 y — yields the result. □ 


Proof of Proposition 4. 

The posterior mean is (/ f {6)a{0)p g {x\-t) d 6 )/(J a{9)p g {x\-t) d9). From [ ?K86], 
if Li(9) = jlnp g (xi: t ) + j 9 i( 8 ) and L 2 = \ \n.p g (x 1:t ) + \g 2 { 9 ) we have 


f e t L 2 (s) 

/ e tL iW)dO 


det Hi 
det H '2 


e t(i2(fl 2 )-Li(0 i))(i + 0 (l/t 2 )) 


( 22 ) 


where 9\ = arg max L\ , 9 2 = argmaxL 2 , and 177 and H 2 are the Hessian 
matrices of —L\ and — L 2 at 9\ and 0 2 , respectively. Here we have g\ = 
lna(0) and g 2 = g\ + Inf(9) (assuming / is positive; otherwise, add a 
constant to /). 

From a Taylor expansion of L\ as in Proposition 2 we find 9\ = 9 + 
jJ^~ 1 d g gi(9f 1L ) + 0(l/t 2 ) and likewise for 9 2 . So 6 \ — 8 2 = jJff x d g {g\ — 
< 72 )(^ L ) + 0(l/t 2 )- Since 9 2 maximizes L 2l a Taylor expansion of L 2 around 
9 2 gives 

L 2 ( 8 i) = L 2 (9 2 ) - i(0! - 0 2 ) T ^ 2 (0 i - 82 ) + 0(l/t 3 ) (23) 


so that, using L 2 = L\ + j In / we find 

£ 2 (0 2 ) - = l 2 (8 1 ) - L^i) + i(0! - 0 2 ) t ^ 2 (0i - e 2 ) + o(i/t 3 ) 

(24) 

= \ In f{6 1 ) + ^(d e In f) T Jf 1 H 2 Jf 1 d e In / + 0(l/f 3 ) 

(25) 

where the second term is evaluated at 9 We have H 2 = J t + 0(l/t), 
so exp (t(L 2 (6 2 ) - Li(9i))) = /(<9i)(l + ^{d g ln/) T J t _1 d g In / + 0(l/i 2 )). 
Meanwhile, by a Taylor expansion of In det(— d g L 2 (8 2 )) around 9 2 , 


det H 2 = det(—d g L 2 (9 2 )) = det(-<9jL 2 (#i)) (l + (0 2 - 9i) T d g lndet(-d g L 2 ) + 0(9 2 

(26) 
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and from L 2 = L\ + j In/ and det(A+e.B) = det(A)(l+eTr(yl 1 B)+0(e 2 )), 

det(-<9fx 2 (6*i)) = det(-5|Li(6>i)) (l + j Tr ((S^i) -1 ^ ( ln /)) +0(l/f 2 )^ 

(27) 

= (det^r) (l - \ TV (iLf 1 ^ 2 (In/)) + 0(l/t 2 )) (28) 

so, collecting, 

= 1 - 7^2 - 0i) T de Indet(—( 9 |L 2 ) + ^ Tr (In/)) + 0(1 A 2 ) 

(29) 

but 6 * 2 — 0 i = •// 1 do In /+ 0 (l/t 2 ), and X 2 = L+ 0 (l/t) and TR = Jt+ 0 (l/t), 
so that 

^ det I H 2 =1 ~Yt^ dd ln ^ J ^ de ln det (~ d 0 L ) + ^ Tr (' J r lg e(ln /)) + 0 (l/t 2 ) 

(30) 

Collecting from (22), expanding /(0i) = /( 6 *A IL )(1+^(O© In fj r Jf 1 dg ln ct+ 
0(l/t 2 )), and expanding dglnf in terms of dgf proves Proposition 4. □ 

Proof of Corollary 5. 

Let us work in natural coordinates for an exponential family (indeed, since 
the statement is intrinsic, it is enough to prove it in some coordinate system). 

In these coordinates, for any x, dg ln pg(x) = — 1(9) withX the Fisher matrix, 
so that —dgL = 1(9). Apply Proposition 4 to f(9) = pg(y), expanding 
dgf = fdg ln / and using <9| ln / = -1(9). □ 

Proof of Proposition 6. 

The Levi-Civita connection on a Riemannian manifold with metric g satisfies 
Vj hr det A- = (A _ 1 )*VjA- thanks to <91ndetM = Tr(M - 1 <9M) and by 
expanding VA Applying this to A\ = I jk 'V ki L and using VX = 0 proves 
the first statement. Moreover, for any function /, at a critical point of /, 
VjVjVfc/ = \7idjdkf — r* fc V;Vi/ and consequently at a critical point of /, 
with Hij = ViVjf, 

V, ln det (g ij H jk ) = (H^V&djf - (H- l y k T) k H a (31) 

In the natural parametrization of an exponential family, —d 2 L is identi¬ 
cally equal to the Fisher metric X. Consequently, V; lndet(— l l ^’S7 2 k L) = 
l ij Vilij - X jk F i j] fL i i = -l^ k r jk l u since VX = 0. So from (17), using 
d = V = d for scalars, and V 2 X = —X at 0 ML , we get in this parametrization 

V m = -\l ml d l In det (—X" 1 V 2 X) = ]-T nl l lk T) k l l i = \l jk T™ k (32) 

2 2 J 2 J 
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The Christoffel symbols T in this parametrization can be computed from 

dilj k (0) = diE x ~p e dj In pg(x)d k In p g (x) (33) 

— ijk ^ij^x^pgdk In pfj(x) In po(xj — Tijk (34) 

because didj\npg(x) = —lij(0) for any x in this parametrization, and be¬ 
cause ¥<d\a.pg(x) = 0. So T' jk = 4 I ll Tjki in this parametrization. This ends 
the proof. □ 

Acknowledgments. I would like to thank Peter Griinwald for valuable 
comments. 
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